The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct...
CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set...
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid...
Set sharing is an abstract domain in which each concrete object is represented by the set of local variables from which it might be reachable. It is a useful abstraction to detect parallelism opportunities, since it contains definite information about which variables do not share in memory, i.e., about when the memory regions reachable from those variables are disjoint. Set sharing is a more precise...
Demand for instruction level parallelism calls for increasing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distributed banks, which offers a technology scalable substrate but a challenging compilation target. This paper introduces a register allocator for spatially partitioned architectures. The...
Partial differential equation solvers spend most of their computation time performing nearest neighbor (stencil) computations on grids that model spatial domains. Tiling is an effective performance optimization for improving the data locality and enabling course-grain parallelization for such computations. However, when the domains are periodic, tiling through time is not directly applicable due to...
Dependence information between program values is extensively used in many program optimization techniques. The ability to identify statements, calls and loop iterations that do not depend on each other enables many transformations which increase the instruction and thread-level parallelism in a program. When program variables contain complex data structures including arrays, records, and recursive...
Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error...
Certain high-performance applications like multimedia and gaming have performance requirements beyond reducing program execution time. These applications have repetitive components whose desired performance characteristics are more naturally expressed using soft real-time theory with its probabilistic guarantees. However, for large complex gaming and multimedia applications, programmers typically...
In this paper we propose a lock assignment technique to simplify the mutual exclusion enforcement in multithreaded programs. Programmers are allowed to annotate the regions of code that are expected to be mutually exclusive as critical sections, without using explicit locks. The compiler then automatically infers an assignment of the minimum number of locks to critical sections by solving the Minimum...
The move to multi-core has increased interest in parallelizing sequential programs. Classical dependency-based techniques, although successful for some classes of programs, often fail due to the one-sided (conservative) approximation of program behavior. Thread-level speculation enables increased parallelism by allowing out-of-order execution: correct dependences are ensured by run-time monitoring...
This paper describes a safety analysis for a multithreaded system based upon transactional memory. The analysis guarantees that shared data is always read and written from within a transaction, while allowing for unsynchronized access to thread-local and (shared) read-only data, as well as the migration of data between threads. The analysis is based on a type and effect system for object-oriented...
The increasing complexity of computer architectures has made the approach of automatically generating code that is optimized for the target machine a growing area of interest. Examples of such systems are library generators, such as ATLAS, SPIRAL, and FFTW. To generate optimized code without manual intervention, these systems need to know the values of certain hardware parameters, such as the cache...
As memory hierarchy becomes deeper and shared by more processors, locality increasingly determines system performance. As a rigorous and precise locality model, reuse distance has been used in program optimizations, performance prediction, memory disambiguation, and locality phase prediction. However, the high cost of measurement has been severely impeding its uses in scenarios requiring high efficiency,...
As the amount of on-chip cache increases as a result of Moore’s law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two...
As hardware systems move toward multicore and multithreaded architectures, programmers increasingly rely on automated tools to help with both the parallelization of legacy codes and effective exploitation of all available hardware resources. Thread-level speculation (TLS) has been proposed as a technique to parallelize the execution of serial codes or serial sections of parallel codes. One of the...
Problem-solving frameworks in large-scale and wide-area environments must handle connectivity issues (NATs and firewalls), maintain scalability with respect to connection management, accommodate dynamic processes joining/leaving at runtime, and provide simple means to tolerate communication/node failures. All of the above must be presented in a simple and flexible programming model. This paper designs...
Compacting garbage collector (GC) is widely used due to its good properties of in-place collection and heap de-fragmentation. In addition, it supports fast bump-pointer allocation and provides good access locality. Most known commercial JVM or CLR implementations use compaction algorithm in certain garbage collection scenarios, such as in full heap or mature object space collections. LISP2 compactor...
Programming paradigms are designed to express algorithms elegantly and efficiently. There are many parallel programming paradigms, each suited to a certain class of problems. Selecting the best parallel programming paradigm for a problem minimizes programming effort and maximizes performance. Given the increasing complexity of parallel applications, no one paradigm may be suitable for all components...
Conventional iterative solvers for partial differential equations impose strict data dependencies between each solution point and its neighbors. When implemented in OpenMP, they repeatedly execute barrier synchronization in each iterative step to ensure that data dependencies are strictly satisfied. We propose new parallel annotations to support an asynchronous computation model for iterative solvers...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.